Customer Segmentation Using K-Means Clustering

In [ ]:
# handle table-like data and matrices
import pandas as pd
import numpy as np

# visualisation
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

# preprocessing
from sklearn.preprocessing import StandardScaler

# pca
from sklearn.decomposition import PCA

# clustering
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans, AgglomerativeClustering

# evaluations
from sklearn.metrics import confusion_matrix

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# to display the total number columns present in the dataset
pd.set_option('display.max_columns', None)

Import Data and handle missing values

Dataset Features¶

  • ID: Customer's unique identifier
  • Year_Birth: Customer's birth year
  • Education: Customer's education level
  • Marital_Status: Customer's marital status
  • Income: Customer's yearly household income
  • Kidhome: Number of children in customer's household
  • Teenhome: Number of teenagers in customer's household
  • Dt_Customer: Date of customer's enrollment with the company
  • Recency: Number of days since customer's last purchase
  • Complain: 1 if the customer complained in the last 2 years, 0 otherwise
  • MntWines: Amount spent on wine in last 2 years
  • MntFruits: Amount spent on fruits in last 2 years
  • MntMeatProducts: Amount spent on meat in last 2 years
  • MntFishProducts: Amount spent on fish in last 2 years
  • MntSweetProducts: Amount spent on sweets in last 2 years
  • MntGoldProds: Amount spent on gold in last 2 years
  • NumDealsPurchases: Number of purchases made with a discount
  • AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
  • AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
  • AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
  • AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
  • AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
  • Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
  • NumWebPurchases: Number of purchases made through the company’s website
  • NumCatalogPurchases: Number of purchases made using a catalogue
  • NumStorePurchases: Number of purchases made directly in stores
  • NumWebVisitsMonth: Number of visits to company’s website in the last month
In [ ]:
df = pd.read_csv('marketing_campaign.csv', sep="\t")  
In [ ]:
df.isnull().sum()
Out[ ]:
ID                      0
Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Z_CostContact           0
Z_Revenue               0
Response                0
dtype: int64
In [ ]:
msno.matrix(df)
Out[ ]:
<Axes: >
No description has been provided for this image
In [ ]:
df = df.dropna()
In [ ]:
df.isnull().sum()
     
Out[ ]:
ID                     0
Year_Birth             0
Education              0
Marital_Status         0
Income                 0
Kidhome                0
Teenhome               0
Dt_Customer            0
Recency                0
MntWines               0
MntFruits              0
MntMeatProducts        0
MntFishProducts        0
MntSweetProducts       0
MntGoldProds           0
NumDealsPurchases      0
NumWebPurchases        0
NumCatalogPurchases    0
NumStorePurchases      0
NumWebVisitsMonth      0
AcceptedCmp3           0
AcceptedCmp4           0
AcceptedCmp5           0
AcceptedCmp1           0
AcceptedCmp2           0
Complain               0
Z_CostContact          0
Z_Revenue              0
Response               0
dtype: int64
In [ ]:
df.duplicated().sum()
Out[ ]:
0

Feature Engineering

In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2216 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2216 non-null   int64  
 1   Year_Birth           2216 non-null   int64  
 2   Education            2216 non-null   object 
 3   Marital_Status       2216 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2216 non-null   int64  
 6   Teenhome             2216 non-null   int64  
 7   Dt_Customer          2216 non-null   object 
 8   Recency              2216 non-null   int64  
 9   MntWines             2216 non-null   int64  
 10  MntFruits            2216 non-null   int64  
 11  MntMeatProducts      2216 non-null   int64  
 12  MntFishProducts      2216 non-null   int64  
 13  MntSweetProducts     2216 non-null   int64  
 14  MntGoldProds         2216 non-null   int64  
 15  NumDealsPurchases    2216 non-null   int64  
 16  NumWebPurchases      2216 non-null   int64  
 17  NumCatalogPurchases  2216 non-null   int64  
 18  NumStorePurchases    2216 non-null   int64  
 19  NumWebVisitsMonth    2216 non-null   int64  
 20  AcceptedCmp3         2216 non-null   int64  
 21  AcceptedCmp4         2216 non-null   int64  
 22  AcceptedCmp5         2216 non-null   int64  
 23  AcceptedCmp1         2216 non-null   int64  
 24  AcceptedCmp2         2216 non-null   int64  
 25  Complain             2216 non-null   int64  
 26  Z_CostContact        2216 non-null   int64  
 27  Z_Revenue            2216 non-null   int64  
 28  Response             2216 non-null   int64  
dtypes: float64(1), int64(25), object(3)
memory usage: 519.4+ KB
In [ ]:
#Dt_Customer is not datetime
df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'], dayfirst=True)
df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'], format='mixed', dayfirst=True)
In [ ]:
print("The newest customer's enrolment date in the records:", max(df['Dt_Customer']))
print("The oldest customer's enrolment date in the records:", min(df['Dt_Customer']))
The newest customer's enrolment date in the records: 2014-06-29 00:00:00
The oldest customer's enrolment date in the records: 2012-07-30 00:00:00
In [ ]:
#Extract the age of the customer
df['Age'] = 2024 - df['Year_Birth']  
In [ ]:
#Create a feature called Expenditure indicating the total amount spent by the customer in various categories over the span of two years.
df['Expenditure'] = df['MntWines'] + df['MntFruits'] + df['MntMeatProducts'] + df['MntFishProducts'] + df['MntSweetProducts'] + df['MntGoldProds']

#Create a feature called Stays_with indicating the living situation.
df['Stays_With'] = df['Marital_Status'].replace({'Married':'Partner', 'Together':'Partner', 'Absurd':'Alone', 'Widow':'Alone', 'YOLO':'Alone', 'Divorced':'Alone', 'Single':'Alone'})
  
#create a feature called offspring indicating the number of children the customer has.
df['Offspring'] = df['Kidhome'] + df['Teenhome'] 

#create a feature called family_size to find out the number of people living in the family.
df['Family_Size'] = df['Stays_With'].replace({'Alone': 1, 'Partner':2}) + df['Offspring']  

#Create a feature "Is_Parent" to indicate parenthood status
df['Is_Parent'] = np.where(df.Offspring > 0, 1, 0)  

#Segmentation of education level
df['Education'] = df['Education'].replace({'Basic':'Undergraduate', '2n Cycle':'Undergraduate', 'Graduation':'Graduate', 'Master':'Postgraduate', 'PhD':'Postgraduate'})
     
In [ ]:
#Drop the features that are not important
drop = ['Marital_Status', 'Dt_Customer', 'Z_CostContact', 'Z_Revenue', 'Year_Birth', 'ID']
df = df.drop(drop, axis=1)
     
In [ ]:
df.head()
Out[ ]:
Education Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Expenditure Stays_With Offspring Family_Size Is_Parent
0 Graduate 58138.0 0 0 58 635 88 546 172 88 88 3 8 10 4 7 0 0 0 0 0 0 1 67 1617 Alone 0 1 0
1 Graduate 46344.0 1 1 38 11 1 6 2 1 6 2 1 1 2 5 0 0 0 0 0 0 0 70 27 Alone 2 3 1
2 Graduate 71613.0 0 0 26 426 49 127 111 21 42 1 8 2 10 4 0 0 0 0 0 0 0 59 776 Partner 0 2 0
3 Graduate 26646.0 1 0 26 11 4 20 10 3 5 2 2 0 4 6 0 0 0 0 0 0 0 40 53 Partner 1 3 1
4 Postgraduate 58293.0 1 0 94 173 43 118 46 27 15 5 5 3 6 5 0 0 0 0 0 0 0 43 422 Partner 1 3 1

Data Visualization

In [ ]:
df.shape
Out[ ]:
(2216, 29)
In [ ]:
df.describe()
Out[ ]:
Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Expenditure Offspring Family_Size Is_Parent
count 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000 2216.000000
mean 52247.251354 0.441787 0.505415 49.012635 305.091606 26.356047 166.995939 37.637635 27.028881 43.965253 2.323556 4.085289 2.671029 5.800993 5.319043 0.073556 0.074007 0.073105 0.064079 0.013538 0.009477 0.150271 55.179603 607.075361 0.947202 2.592509 0.714350
std 25173.076661 0.536896 0.544181 28.948352 337.327920 39.793917 224.283273 54.752082 41.072046 51.815414 1.923716 2.740951 2.926734 3.250785 2.425359 0.261106 0.261842 0.260367 0.244950 0.115588 0.096907 0.357417 11.985554 602.900476 0.749062 0.905722 0.451825
min 1730.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 28.000000 5.000000 0.000000 1.000000 0.000000
25% 35303.000000 0.000000 0.000000 24.000000 24.000000 2.000000 16.000000 3.000000 1.000000 9.000000 1.000000 2.000000 0.000000 3.000000 3.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 47.000000 69.000000 0.000000 2.000000 0.000000
50% 51381.500000 0.000000 0.000000 49.000000 174.500000 8.000000 68.000000 12.000000 8.000000 24.500000 2.000000 4.000000 2.000000 5.000000 6.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 54.000000 396.500000 1.000000 3.000000 1.000000
75% 68522.000000 1.000000 1.000000 74.000000 505.000000 33.000000 232.250000 50.000000 33.000000 56.000000 3.000000 6.000000 4.000000 8.000000 7.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 65.000000 1048.000000 1.000000 3.000000 1.000000
max 666666.000000 2.000000 2.000000 99.000000 1493.000000 199.000000 1725.000000 259.000000 262.000000 321.000000 15.000000 27.000000 28.000000 13.000000 20.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 131.000000 2525.000000 3.000000 5.000000 1.000000
In [ ]:
df.describe(include=object).T
Out[ ]:
count unique top freq
Education 2216 3 Graduate 1116
Stays_With 2216 2 Partner 1430
In [ ]:
#Plot of Expenditure , Income and Age of Customer
sns.pairplot(df , vars=['Expenditure','Income','Age'] , hue='Offspring', palette='husl')
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x1f5f8307b90>
No description has been provided for this image
In [ ]:
#Plot of Expenditure and Income of Customer
plt.figure(figsize=(13,8))
sns.scatterplot(x=df[df['Income']<600000]['Expenditure'], y=df[df['Income']<600000]['Income'], color='#f060a8')
Out[ ]:
<Axes: xlabel='Expenditure', ylabel='Income'>
No description has been provided for this image
In [ ]:
#Plot of Expenditure and Age of Customer
plt.figure(figsize=(13,8))
sns.scatterplot(x=df['Expenditure'], y=df['Age'])
Out[ ]:
<Axes: xlabel='Expenditure', ylabel='Age'>
No description has been provided for this image
In [ ]:
#Plot of Expenditure  and Education of Customer
plt.figure(figsize=(13,8))
sns.histplot(x=df['Expenditure'], hue=df['Education'])
Out[ ]:
<Axes: xlabel='Expenditure', ylabel='Count'>
No description has been provided for this image
In [ ]:
#Plot of Education
df['Education'].value_counts().plot.pie(explode=[0.1,0,0], autopct='%1.1f%%', shadow=True, figsize=(8,8), colors=sns.color_palette('muted'))
     
Out[ ]:
<Axes: ylabel='count'>
No description has been provided for this image

Outlier Detection

In [ ]:
#Plot of Age of Customer
plt.figure(figsize=(13,8))
sns.distplot(df.Age, color='#1943c2')
Out[ ]:
<Axes: xlabel='Age', ylabel='Density'>
No description has been provided for this image
In [ ]:
#Plot of Income of Customer
plt.figure(figsize=(13,8))
sns.distplot(df.Income, color='#14db74')
Out[ ]:
<Axes: xlabel='Income', ylabel='Density'>
No description has been provided for this image
In [ ]:
#Plot of Age of Customer
plt.figure(figsize=(13,8))
sns.distplot(df.Expenditure, color='#db1488')
Out[ ]:
<Axes: xlabel='Expenditure', ylabel='Density'>
No description has been provided for this image
In [ ]:
#Detect outliers 
fig = make_subplots(rows=1, cols=3)

fig.add_trace(go.Box(y=df['Age'], notched=True, name='Age', marker_color = '#6699ff', 
                     boxmean=True, boxpoints='suspectedoutliers'), 1, 2)

fig.add_trace(go.Box(y=df['Income'], notched=True, name='Income', marker_color = '#ff0066', 
                     boxmean=True, boxpoints='suspectedoutliers'), 1, 1)

fig.add_trace(go.Box(y=df['Expenditure'], notched=True, name='Expenditure', marker_color = 'lightseagreen', 
                     boxmean=True, boxpoints='suspectedoutliers'), 1, 3)

fig.update_layout(title_text='Box Plots for Numerical Variables')

fig.show()
     
In [ ]:
df.head(1)
Out[ ]:
Education Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Expenditure Stays_With Offspring Family_Size Is_Parent
0 Graduate 58138.0 0 0 58 635 88 546 172 88 88 3 8 10 4 7 0 0 0 0 0 0 1 67 1617 Alone 0 1 0
In [ ]:
numerical = ['Income', 'Recency', 'Age', 'Expenditure']
In [ ]:
def detect_outliers(d):
  for i in d:
    Q3, Q1 = np.percentile(df[i], [75 ,25])
    IQR = Q3 - Q1

    ul = Q3+1.5*IQR
    ll = Q1-1.5*IQR

    outliers = df[i][(df[i] > ul) | (df[i] < ll)]
    print(f'*** {i} outlier points***', '\n', outliers, '\n')
In [ ]:
detect_outliers(numerical)
*** Income outlier points*** 
 164     157243.0
617     162397.0
655     153924.0
687     160803.0
1300    157733.0
1653    157146.0
2132    156924.0
2233    666666.0
Name: Income, dtype: float64 

*** Recency outlier points*** 
 Series([], Name: Recency, dtype: int64) 

*** Age outlier points*** 
 192    124
239    131
339    125
Name: Age, dtype: int64 

*** Expenditure outlier points*** 
 1179    2525
1492    2524
1572    2525
Name: Expenditure, dtype: int64 

In [ ]:
df = df[(df['Age']<100)]
df = df[(df['Income']<600000)]
In [ ]:
df.shape
Out[ ]:
(2212, 29)
In [ ]:
categorical = [var for var in df.columns if df[var].dtype=='O']
# check the number of different labels
for var in categorical:
    print(df[var].value_counts() / float(len(df)))
    print()
    print()
Education
Graduate         0.504069
Postgraduate     0.382007
Undergraduate    0.113924
Name: count, dtype: float64


Stays_With
Partner    0.64557
Alone      0.35443
Name: count, dtype: float64


In [ ]:
categorical
Out[ ]:
['Education', 'Stays_With']
In [ ]:
df['Stays_With'].unique()
Out[ ]:
array(['Alone', 'Partner'], dtype=object)
In [ ]:
df['Education'].unique()
Out[ ]:
array(['Graduate', 'Postgraduate', 'Undergraduate'], dtype=object)
In [ ]:
df['Education'] = df['Education'].map({'Undergraduate':0,'Graduate':1, 'Postgraduate':2})
     

df['Stays_With'] = df['Stays_With'].map({'Alone':0,'Partner':1})
In [ ]:
df.dtypes
Out[ ]:
Education                int64
Income                 float64
Kidhome                  int64
Teenhome                 int64
Recency                  int64
MntWines                 int64
MntFruits                int64
MntMeatProducts          int64
MntFishProducts          int64
MntSweetProducts         int64
MntGoldProds             int64
NumDealsPurchases        int64
NumWebPurchases          int64
NumCatalogPurchases      int64
NumStorePurchases        int64
NumWebVisitsMonth        int64
AcceptedCmp3             int64
AcceptedCmp4             int64
AcceptedCmp5             int64
AcceptedCmp1             int64
AcceptedCmp2             int64
Complain                 int64
Response                 int64
Age                      int64
Expenditure              int64
Stays_With               int64
Offspring                int64
Family_Size              int64
Is_Parent                int32
dtype: object
In [ ]:
df.head(3)
Out[ ]:
Education Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Expenditure Stays_With Offspring Family_Size Is_Parent
0 1 58138.0 0 0 58 635 88 546 172 88 88 3 8 10 4 7 0 0 0 0 0 0 1 67 1617 0 0 1 0
1 1 46344.0 1 1 38 11 1 6 2 1 6 2 1 1 2 5 0 0 0 0 0 0 0 70 27 0 2 3 1
2 1 71613.0 0 0 26 426 49 127 111 21 42 1 8 2 10 4 0 0 0 0 0 0 0 59 776 1 0 2 0
In [ ]:
corrmat = df.corr()

plt.figure(figsize=(20,20))  
sns.heatmap(corrmat, annot = True, cmap = 'crest', center = 0)
Out[ ]:
<Axes: >
No description has been provided for this image

Feature Scaling

In [ ]:
df_old = df.copy()
In [ ]:
# drop columns
drop_columns = ['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1','AcceptedCmp2', 'Complain', 'Response']
df = df.drop(drop_columns , axis=1)
In [ ]:
scaler = StandardScaler()
df = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)
In [ ]:
df.head(3)
Out[ ]:
Education Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Age Expenditure Stays_With Offspring Family_Size Is_Parent Clusters
0 -0.411675 0.287105 -0.822754 -0.929699 0.310353 0.977660 1.552041 1.690293 2.453472 1.483713 0.852576 0.351030 1.426865 2.503607 -0.555814 0.692181 1.018352 1.676245 -1.349603 -1.264598 -1.758359 -1.581139 -0.310687
1 -0.411675 -0.260882 1.040021 0.908097 -0.380813 -0.872618 -0.637461 -0.718230 -0.651004 -0.634019 -0.733642 -0.168701 -1.126420 -0.571340 -1.171160 -0.132545 1.274785 -0.963297 -1.349603 1.404572 0.449070 0.632456 1.801732
2 -0.411675 0.913196 -0.822754 -0.929699 -0.795514 0.357935 0.570540 -0.178542 1.339513 -0.147184 -0.037254 -0.688432 1.426865 -0.229679 1.290224 -0.544908 0.334530 0.280110 0.740959 -1.264598 -0.654644 -1.581139 0.393452

Dimensionality Reduction

In [ ]:
p = PCA(n_components=3)
p.fit(df)
Out[ ]:
PCA(n_components=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=3)
In [ ]:
W = p.components_.T
W
Out[ ]:
array([[ 1.30778771e-02,  1.44277528e-01, -4.70771426e-01],
       [ 2.81248858e-01,  1.77878164e-01, -8.76658983e-02],
       [-2.45692674e-01,  5.76058529e-03,  2.69015093e-01],
       [-9.37194917e-02,  4.61594829e-01, -1.34887786e-01],
       [ 3.71636597e-03,  1.63414769e-02,  2.70181007e-02],
       [ 2.55668617e-01,  1.95331827e-01, -7.14503115e-02],
       [ 2.36504204e-01, -2.18575420e-03,  2.53321278e-01],
       [ 2.83905038e-01,  3.46528569e-05,  7.86274537e-02],
       [ 2.46940306e-01, -1.14624037e-02,  2.45744737e-01],
       [ 2.35683002e-01,  9.38301189e-03,  2.52354483e-01],
       [ 1.87969033e-01,  1.10934749e-01,  2.14824588e-01],
       [-7.54951077e-02,  3.39479300e-01,  1.88189814e-01],
       [ 1.68843410e-01,  2.83371310e-01,  6.95469695e-02],
       [ 2.76613344e-01,  9.47168407e-02,  3.11027304e-02],
       [ 2.41628747e-01,  1.92147752e-01,  2.15568134e-02],
       [-2.25384885e-01,  4.40103706e-02,  1.24000647e-01],
       [ 4.07595259e-02,  2.35408023e-01, -4.01417759e-01],
       [ 3.19061284e-01,  1.18345335e-01,  6.39589304e-02],
       [-2.73771544e-02,  1.15882801e-01,  2.90603792e-01],
       [-2.44084527e-01,  3.39331796e-01,  9.47820165e-02],
       [-2.16313970e-01,  3.41811001e-01,  2.31810203e-01],
       [-2.39720715e-01,  2.89828398e-01,  9.72583728e-02],
       [ 9.92652778e-02,  2.03177111e-01, -2.34086400e-01]])
In [ ]:
pd.DataFrame(W, index=df.columns, columns=['W1','W2','W3'])
Out[ ]:
W1 W2 W3
Education 0.013078 0.144278 -0.470771
Income 0.281249 0.177878 -0.087666
Kidhome -0.245693 0.005761 0.269015
Teenhome -0.093719 0.461595 -0.134888
Recency 0.003716 0.016341 0.027018
MntWines 0.255669 0.195332 -0.071450
MntFruits 0.236504 -0.002186 0.253321
MntMeatProducts 0.283905 0.000035 0.078627
MntFishProducts 0.246940 -0.011462 0.245745
MntSweetProducts 0.235683 0.009383 0.252354
MntGoldProds 0.187969 0.110935 0.214825
NumDealsPurchases -0.075495 0.339479 0.188190
NumWebPurchases 0.168843 0.283371 0.069547
NumCatalogPurchases 0.276613 0.094717 0.031103
NumStorePurchases 0.241629 0.192148 0.021557
NumWebVisitsMonth -0.225385 0.044010 0.124001
Age 0.040760 0.235408 -0.401418
Expenditure 0.319061 0.118345 0.063959
Stays_With -0.027377 0.115883 0.290604
Offspring -0.244085 0.339332 0.094782
Family_Size -0.216314 0.341811 0.231810
Is_Parent -0.239721 0.289828 0.097258
Clusters 0.099265 0.203177 -0.234086
In [ ]:
pd.DataFrame(W, index=df.columns, columns=['W1','W2','W3'])
Out[ ]:
W1 W2 W3
Education 0.013078 0.144278 -0.470771
Income 0.281249 0.177878 -0.087666
Kidhome -0.245693 0.005761 0.269015
Teenhome -0.093719 0.461595 -0.134888
Recency 0.003716 0.016341 0.027018
MntWines 0.255669 0.195332 -0.071450
MntFruits 0.236504 -0.002186 0.253321
MntMeatProducts 0.283905 0.000035 0.078627
MntFishProducts 0.246940 -0.011462 0.245745
MntSweetProducts 0.235683 0.009383 0.252354
MntGoldProds 0.187969 0.110935 0.214825
NumDealsPurchases -0.075495 0.339479 0.188190
NumWebPurchases 0.168843 0.283371 0.069547
NumCatalogPurchases 0.276613 0.094717 0.031103
NumStorePurchases 0.241629 0.192148 0.021557
NumWebVisitsMonth -0.225385 0.044010 0.124001
Age 0.040760 0.235408 -0.401418
Expenditure 0.319061 0.118345 0.063959
Stays_With -0.027377 0.115883 0.290604
Offspring -0.244085 0.339332 0.094782
Family_Size -0.216314 0.341811 0.231810
Is_Parent -0.239721 0.289828 0.097258
Clusters 0.099265 0.203177 -0.234086
In [ ]:
p.explained_variance_
Out[ ]:
array([8.34736087, 3.0107941 , 1.4676905 ])
In [ ]:
p.explained_variance_ratio_
Out[ ]:
array([0.36276466, 0.13084491, 0.06378378])
In [ ]:
pd.DataFrame(p.explained_variance_ratio_, index=range(1,4), columns=['Explained Variability'])
Out[ ]:
Explained Variability
1 0.362765
2 0.130845
3 0.063784
In [ ]:
p.explained_variance_ratio_.cumsum()
Out[ ]:
array([0.99950708, 0.99991031, 0.99997959])
In [ ]:
sns.barplot(x = list(range(1,4)), y = p.explained_variance_, palette = 'Paired')
plt.xlabel('i')
plt.ylabel('Lambda i')
Out[ ]:
Text(0, 0.5, 'Lambda i')
No description has been provided for this image
In [ ]:
df_PCA = pd.DataFrame(p.transform(df), columns=(['col1', 'col2', 'col3']))
     
In [ ]:
df_PCA.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
col1 2212.0 -1.220643e-16 2.889180 -5.972571 -2.593182 -0.747752 2.424818 7.444202
col2 2212.0 3.212219e-17 1.735164 -4.471689 -1.386241 -0.105978 1.308408 6.089454
col3 2212.0 -2.890997e-17 1.211485 -3.998819 -0.866740 -0.006152 0.828171 5.304697
In [ ]:
x = df_PCA['col1']
y = df_PCA['col2']
z = df_PCA['col3']

fig = plt.figure(figsize=(13,8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x,y,z, c='#e64363', marker='o')
ax.set_title('A 3D Projection of Data In the Reduced Dimension')
plt.show()
     
No description has been provided for this image

Train the model using K-means clustering

In [ ]:
# Elbow Method to determine the number of clusters to be formed.
Elbow_M = KElbowVisualizer(KMeans(), k=10)
Elbow_M.fit(df_PCA)
Elbow_M.show()
No description has been provided for this image
Out[ ]:
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
In [ ]:
AC = AgglomerativeClustering(n_clusters=4)
# fit model and predict clusters
yhat_AC = AC.fit_predict(df_PCA)
df_PCA['Clusters'] = yhat_AC
#Adding the Clusters feature to the orignal dataframe.
df['Clusters'] = yhat_AC
df_old['Clusters'] = yhat_AC
In [ ]:
fig = plt.figure(figsize=(13,8))
ax = plt.subplot(111, projection='3d', label='bla')
ax.scatter(x, y, z, s=40, c=df_PCA['Clusters'], marker='o', cmap='flare')
ax.set_title('Clusters')
plt.show()
No description has been provided for this image
In [ ]:
plt.figure(figsize=(13,8))
pl = sns.countplot(x=df['Clusters'], palette= "viridis")
pl.set_title('Distribution Of The Clusters')
plt.show()
No description has been provided for this image
In [ ]:
plt.figure(figsize=(13,8))
pl = sns.scatterplot(data=df_old, x=df_old['Expenditure'], y=df_old['Income'], hue=df_old['Clusters'], palette= "magma")
pl.set_title("Cluster's Profile Based on Income and Spending")
plt.legend()
Out[ ]:
<matplotlib.legend.Legend at 0x1f60835fb90>
No description has been provided for this image

Cluster Patterns

  • group 0 is indicated in black and have average expenditure and average income
  • group 1 is indicated in purple and have low expenditure and low income
  • group 2 is indicated in orange and have high expenditure and average income
  • group 3 is indicated in yellow and have high expenditure and high income
In [ ]:
plt.figure(figsize=(13,8))
pl = sns.swarmplot(x=df_old['Clusters'], y=df_old['Expenditure'], color="#CBEDDD", alpha=0.7)
pl = sns.boxenplot(x=df_old['Clusters'], y=df_old['Expenditure'], palette= "rocket_r")
plt.show()
No description has been provided for this image
In [ ]:
df_old['Total_Promos'] = df_old['AcceptedCmp1']+ df_old['AcceptedCmp2']+ df_old['AcceptedCmp3']+ df_old['AcceptedCmp4']+ df_old['AcceptedCmp5']

plt.figure(figsize=(13,8))
pl = sns.countplot(x=df_old['Total_Promos'], hue=df_old['Clusters'], palette= "viridis")
pl.set_title('Count Of Promotion Accepted')
pl.set_xlabel('Number Of Total Accepted Promotions')
plt.legend(loc='upper right')
plt.show()
No description has been provided for this image
In [ ]:
plt.figure(figsize=(13,8))
pl=sns.boxenplot(y=df_old['NumDealsPurchases'],x=df_old['Clusters'],  palette= "viridis")
pl.set_title('Number of Deals Purchased')
Out[ ]:
Text(0.5, 1.0, 'Number of Deals Purchased')
No description has been provided for this image
In [ ]:
plt.figure(figsize=(13,8))
sns.jointplot(x=df_old['Kidhome'], y=df_old['Expenditure'], hue=df_old['Clusters'], kind='kde', palette= "magma")
Out[ ]:
<seaborn.axisgrid.JointGrid at 0x1f609ed07d0>
<Figure size 1300x800 with 0 Axes>
No description has been provided for this image
In [ ]:
plt.figure(figsize=(13,8))
sns.jointplot(x=df_old['Stays_With'], y=df_old['Expenditure'], hue=df_old['Clusters'], kind='kde', palette= "magma")
Out[ ]:
<seaborn.axisgrid.JointGrid at 0x1f60c14c790>
<Figure size 1300x800 with 0 Axes>
No description has been provided for this image
In [ ]:
plt.figure(figsize=(13,8))
sns.jointplot(x=df_old['Education'], y=df_old['Expenditure'], hue=df_old['Clusters'], kind='kde', palette= "magma")
Out[ ]:
<seaborn.axisgrid.JointGrid at 0x1f60c6ccd10>
<Figure size 1300x800 with 0 Axes>
No description has been provided for this image
In [ ]:
plt.figure(figsize=(13,8))
sns.jointplot(x=df_old['Is_Parent'], y=df_old['Expenditure'], hue=df_old['Clusters'], kind='kde', palette= "magma")
Out[ ]:
<seaborn.axisgrid.JointGrid at 0x1f608e9c590>
<Figure size 1300x800 with 0 Axes>
No description has been provided for this image
In [ ]:
plt.figure(figsize=(13,8))
sns.jointplot(x=df_old['Family_Size'], y=df_old['Expenditure'], hue=df_old['Clusters'], kind='kde', palette= "magma")
Out[ ]:
<seaborn.axisgrid.JointGrid at 0x1f609ed7ed0>
<Figure size 1300x800 with 0 Axes>
No description has been provided for this image
In [ ]:
plt.figure(figsize=(13,8))
sns.jointplot(x=df_old['Teenhome'], y=df_old['Expenditure'], hue=df_old['Clusters'], kind='kde', palette= "magma")
Out[ ]:
<seaborn.axisgrid.JointGrid at 0x1f609dfe350>
<Figure size 1300x800 with 0 Axes>
No description has been provided for this image
In [ ]:
plt.figure(figsize=(13,8))
sns.jointplot(x=df_old['Age'], y=df_old['Expenditure'], hue=df_old['Clusters'], kind='kde', palette= "magma")
Out[ ]:
<seaborn.axisgrid.JointGrid at 0x1f609c586d0>
<Figure size 1300x800 with 0 Axes>
No description has been provided for this image
In [ ]:
plt.figure(figsize=(13,8))
sns.jointplot(x=df_old['Offspring'], y=df_old['Expenditure'], hue=df_old['Clusters'], kind='kde', palette= "magma")
Out[ ]:
<seaborn.axisgrid.JointGrid at 0x1f5fd2fca50>
<Figure size 1300x800 with 0 Axes>
No description has been provided for this image

Inference:

  • Cluster 0 has customers who are mostly parents with atleast 2 kids , with around five members in the family. Indicated by black

  • Cluster 1 has customers who are mostly parents with one or no kid , with around of three members in the family. Indicated by purple.

  • Cluster 2 has customers who are mostly parents with maximum of two kids , with around of four members in the family. Indicated by orange.

  • Cluster 3 has customers not parents , with around of two members in the family. Indicated by yellow.